Skip to content

feat: unify llm-judge and agent-judge, add agentv provider#617

Merged
christso merged 13 commits intomainfrom
feat/unify-judge-types-614
Mar 15, 2026
Merged

feat: unify llm-judge and agent-judge, add agentv provider#617
christso merged 13 commits intomainfrom
feat/unify-judge-types-614

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 15, 2026

Summary

Closes #614

  • Unified judge types: Absorbed agent-judge into llm-judge with auto-detection. llm-judge now supports three modes:
    • LLM mode (default): Structured JSON evaluation via generateObject
    • Built-in agent mode: When judge provider is agentv, uses AI SDK generateText with sandboxed filesystem tools
    • Delegate mode: When judge provider is an agent provider (claude-cli, codex, etc.), sends evaluation prompt via provider.invoke()
  • New agentv provider: Built-in AI SDK provider that parses provider:model strings (e.g., openai:gpt-5-mini) and creates LanguageModel instances via direct SDK calls. Supports openai, anthropic, azure, google.
  • CLI flags: Added --judge-target and --model flags to agentv eval for overriding judge provider across all evaluators
  • Hard removal: Removed agent-judge entirely — no backward compat, no YAML remapping, as if it never existed. Only llm-judge remains.
  • Code review fixes: Replaced cliModel! non-null assertion with explicit guard, consolidated duplicate delegate methods, added try-catch for RegExp in search_files tool
  • CLAUDE.md: Added "Completing Work — E2E Checklist" section requiring e2e verification for all work before finishing

Key changes

File Change
providers/agentv-provider.ts New — parses model strings, creates AI SDK LanguageModel
evaluators/llm-judge.ts Major — absorbed agent-judge logic, three evaluation modes
evaluators/agent-judge.ts Deleted
types.ts Removed AgentJudgeEvaluatorConfig, added max_steps/temperature to LlmJudgeEvaluatorConfig
registry/builtin-evaluators.ts Removed agentJudgeFactory, updated llmJudgeFactory
loaders/evaluator-parser.ts Removed agent-judge backward compat
loaders/eval-yaml-transpiler.ts Unified NL conversion, only llm-judge cases
validation/eval-file.schema.ts Removed AgentJudgeSchema entirely
orchestrator.ts resolveJudgeProvider handles --judge-target override
apps/cli/.../run.ts --judge-target and --model CLI flags
examples/features/agent-judge/ Deleted entirely
CLAUDE.md Added E2E checklist section

Test plan

  • 10 new tests for AgentvProvider (construction, model parsing, all 4 providers, error cases)
  • 3 new tests for agentv target resolution
  • 1 new test for llm-judge rubrics in transpiler NL conversion
  • All 170 tests pass across key test files (agentv-provider, targets, evaluator-parser, transpiler)
  • Full bun test suite: 1094 tests, 42 pass in evaluators.test.ts
  • tsc --noEmit: zero real type errors (only stale TS6305 output warnings)
  • E2E: agentv eval with standard llm-judge (rubric example) — score=1
  • E2E: agentv eval with rubric evaluator — score=1
  • E2E: --judge-target agentv without --model — proper error message
  • E2E: --judge-target agentv --model google:gemini-2.5-flash with workspace verification — score=1, mode=built-in, steps=3, tool_calls=2

🤖 Generated with Claude Code

christso and others added 9 commits March 15, 2026 12:39
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace createProviderRegistry with direct createOpenAI/createAnthropic/
createAzure/createGoogleGenerativeAI calls to resolve v2/v3 spec version
type compatibility issues. Parse "provider:model" strings manually via a
switch statement. Simplify test mocks and add coverage for google, azure,
and error cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove agent-judge as a separate evaluator type. LlmJudgeEvaluator now
auto-detects mode based on the resolved judge provider:
- LLM providers (azure, anthropic, gemini): structured JSON mode
- Agent providers (claude-cli, copilot, etc.): delegate mode
- agentv provider: built-in AI SDK agent mode with filesystem tools

Closes #614
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The transpiler now handles llm-judge with rubrics the same way as
agent-judge, expanding rubric items into individual NL assertion strings.

Part of #614
- Add explicit guard for --model when --judge-target is agentv (was non-null assertion)
- Consolidate evaluateWithJudgeTarget/evaluateWithDelegatedAgent into shared evaluateWithDelegate
- Add try-catch for RegExp construction in search_files tool (prevents crash on invalid patterns)
- Add comments explaining agentv exclusion from AGENT_PROVIDER_KINDS and AgentJudgeSchema backward compat

Part of #614
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 15, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3c9ed00
Status: ✅  Deploy successful!
Preview URL: https://de46f704.agentv.pages.dev
Branch Preview URL: https://feat-unify-judge-types-614.agentv.pages.dev

View logs

christso and others added 4 commits March 15, 2026 14:06
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso merged commit 228a619 into main Mar 15, 2026
1 check passed
@christso christso deleted the feat/unify-judge-types-614 branch March 15, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: replace llm-judge with agent-judge

1 participant